[XPU] Add swap_cache_layout op to support Mooncake KV cache for XPU.#7728
[XPU] Add swap_cache_layout op to support Mooncake KV cache for XPU.#7728Jiajun-Ji wants to merge 3 commits into
Conversation
|
Thanks for your contribution! |
There was a problem hiding this comment.
Pull request overview
该 PR 在 XPU 路径新增 swap_cache_layout 自定义算子,用于在 XPU KV cache(按 layer 存放)与 CPU pinned buffer(按 block-major、layer-minor 存放)之间进行布局转换与拷贝,从而支持 Mooncake 作为 KV cache storage backend 的 XPU 场景。
Changes:
- Mooncake 配置在 CUDA/XPU 平台下默认自动探测并填充 RDMA NICs。
- cache_manager 的 XPU ops 导出
swap_cache_layout,供存储读写路径使用。 - 新增 XPU 自定义算子实现
swap_cache_layout及其对应测试用例。
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py | XPU 场景下也支持自动选择 RDMA 设备配置 |
| fastdeploy/cache_manager/ops.py | XPU 平台导入并暴露 swap_cache_layout |
| custom_ops/xpu_ops/src/ops/swap_cache_layout.cc | 新增 XPU swap_cache_layout 算子实现(XPU↔CPU pinned buffer 布局转换拷贝) |
| custom_ops/xpu_ops/test/test_swap_cache_layout.py | 新增 swap_cache_layout 的 roundtrip/性能测试用例 |
| auto* cache_cpu_ptr = reinterpret_cast<T*>(cache_cpu_pointer); | ||
|
|
||
| for (int block_idx = 0; block_idx < static_cast<int>(xpu_block_ids.size()); | ||
| block_idx++) { | ||
| auto cur_xpu_block_id = xpu_block_ids[block_idx]; |
| for (int i = 1; i < static_cast<int>(cache_shape.size()); i++) { | ||
| cache_block_stride *= cache_shape[i]; | ||
| } | ||
|
|
| mode, | ||
| ) | ||
| paddle.device.synchronize() | ||
| cost_time = time.time() - start | ||
| print( | ||
| f"swap cache layout ({label}), total_gb: {total_gb:.6f}GB, " | ||
| f"cost_time: {cost_time:.6f}s, speed: {total_gb / cost_time:.6f}GB/s" | ||
| ) | ||
|
|
||
| def test_performance(self): | ||
| for _ in range(3): |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览PR 存在 1 个 Required 任务失败,1 个 Required 任务运行中,合并暂时受阻。
2 任务状态汇总2.1 Required 任务:8/10 通过
2.2 可选任务:27/31 通过
3 失败详情(仅 Required)Approval — 审批流程(置信度: 高)Approval
根因详情: 关键日志: 修复建议:
修复建议摘要: 邀请指定 FastDeploy RD 和 PaddlePaddle RD 各一人 Approve 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-07 11:13:41
📋 Review 摘要
PR 概述:为 XPU 平台新增 swap_cache_layout custom op,实现 XPU KV cache 与 CPU pinned memory 的 layout 转换,并将 mooncake_store 的 RDMA NIC 自动探测扩展到 XPU 平台
变更范围:custom_ops/xpu_ops/src/ops/、fastdeploy/cache_manager/
影响面 Tag:[XPU] [KVCache] [OP]
📝 PR 规范检查
## Modifications 段落内容为空(仅保留模板占位注释),Checklist 条目均未按实际情况勾选。
标题建议(可直接复制):
[XPU][KVCache] Add swap_cache_layout op to support Mooncake KV cache for XPU
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
添加 swap_cache_layout op 以支持 mooncake,在 _run_read_storage 中调用以在 XPU KV cache 与 CPU pinned memory 之间执行 layout swap。mooncake 原生不支持 XPU,在 CPU 内存空间下连接 mooncake 后端。
## Modifications
- 新增 `custom_ops/xpu_ops/src/ops/swap_cache_layout.cc`:实现 XPU KV cache(layout: `[block_num, head_num, block_size, head_dim]`)与 CPU pinned memory(layout: `[block_num, layer_num, head_num, block_size, head_dim]`)之间的数据搬移 op,支持 mode=0(XPU→CPU)和 mode=1(CPU→XPU)两个方向
- 新增 `custom_ops/xpu_ops/test/test_swap_cache_layout.py`:涵盖 roundtrip 正确性验证和 XPU↔CPU 带宽性能测试
- `fastdeploy/cache_manager/ops.py`:修复 XPU 平台下 `swap_cache_layout` 被错误置为 `None` 的问题,改为从 `fastdeploy.model_executor.ops.xpu` 正确导入
- `fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py`:将 RDMA NIC 自动探测逻辑(`get_rdma_nics()`)扩展到 XPU 平台
## Usage or Command
参见 PR 描述中的 shell 启动脚本(启动 Mooncake Master + 双 XPU 实例 + 验证请求)。
## Accuracy Tests
实例 1 写入 mooncake,实例 2 命中 mooncake 缓存(截图已在 PR 描述中提供)。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/xpu_ops/src/ops/swap_cache_layout.cc:74 |
xpu_memcpy 在 layer×block 双重循环中逐次同步调用,大模型场景下串行 XDMA 调用较多 |
总体评价
实现思路清晰,修复了 XPU 平台下 swap_cache_layout 被错误置为 None 的遗留问题,功能完整、roundtrip 测试和性能测试覆盖到位。仅 ## Modifications 段落为空,建议补全以便追溯。
| void* src = (mode == 0) ? static_cast<void*>(xpu_ptr_now) | ||
| : static_cast<void*>(cpu_ptr_now); | ||
|
|
||
| int ret = xpu_memcpy(dst, src, cache_block_stride * sizeof(T), copy_kind); |
There was a problem hiding this comment.
🟡 建议 xpu_memcpy 在 layer_num × block_num 双重循环中逐次同步调用。
对于大模型(32+ 层、多 block 场景),会产生大量串行 XDMA 调用,可能成为吞吐瓶颈。建议评估 XPU runtime 是否支持流式/异步 memcpy 批量提交,或在同一层内批量提交多个 block 的传输请求以提升并发度。
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7728 +/- ##
==========================================
Coverage ? 63.44%
==========================================
Files ? 461
Lines ? 64129
Branches ? 9824
==========================================
Hits ? 40687
Misses ? 20644
Partials ? 2798
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-13 15:35:30
📋 Review 摘要
PR 概述:为 XPU 新增 swap_cache_layout 自定义算子,支持 Mooncake KV cache 跨实例传输
变更范围:custom_ops/xpu_ops/、fastdeploy/cache_manager/
影响面 Tag:[XPU] [KVCache] [OP] [PD Disaggregation]
📝 PR 规范检查
标题含官方 Tag [XPU],格式合规。## Modifications 段落为空,需补充具体改动说明。
PR 描述建议(可直接复制,必须复刻 checklist §D2 模板的完整结构):
## Motivation
mooncake 原生不支持 XPU,需通过 CPU 内存空间作为中转与 mooncake 后端对接。本 PR 为 XPU 新增 `swap_cache_layout` 算子,实现 XPU KV cache(layout:`[block_num, head_num, block_size, head_dim]`)与 CPU pinned memory(layout:`[block_num, layer_num, head_num, block_size, head_dim]`)之间的数据搬移,并在 `_run_read_storage` 中调用该算子完成 XPU KV cache 到 CPU pinned memory 的交换,从而实现 Mooncake 跨实例 KV cache 共享。
## Modifications
- `custom_ops/xpu_ops/src/ops/swap_cache_layout.cc`:新增 `swap_cache_layout` XPU 自定义算子,通过 `xpu_memcpy` 逐层逐块拷贝,支持 mode 0(XPU→CPU)和 mode 1(CPU→XPU)两种方向
- `custom_ops/xpu_ops/test/test_swap_cache_layout.py`:新增算子单测,覆盖 host alloc/free、roundtrip 正确性及 D2H/H2D 性能测试
- `fastdeploy/cache_manager/ops.py`:XPU 分支从 `fastdeploy.model_executor.ops.xpu` 实际导入 `swap_cache_layout`(修复了之前硬编码为 `None` 的问题)
- `fastdeploy/cache_manager/transfer_factory/mooncake_store/mooncake_store.py`:在 RDMA 网卡自动选择条件中增加 `current_platform.is_xpu()` 判断,与 CUDA 平台行为对齐
## Usage or Command
参见 PR 描述中的启动脚本(启动 Mooncake Master + 两个 XPU 实例,发送测试请求验证跨实例 KV cache 命中)。
## Accuracy Tests
已提供 cache_manager.log 日志截图:实例 1 写入 Mooncake,实例 2 成功接受并命中 KV cache。
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [x] Provide accuracy results.
- [ ] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 🟡 建议 | custom_ops/xpu_ops/src/ops/swap_cache_layout.cc:93 |
缺少 gpu_block_ids 与 cpu_block_ids 大小一致性校验,两者长度不一致时内层循环访问 cpu_block_ids[block_idx] 将触发 UB |
| ❓ 疑问 | custom_ops/xpu_ops/test/test_swap_cache_layout.py:34 |
TestAllocCachePinned 类文档注释写的是 xpu_host_alloc/xpu_host_free,但实际导入和调用的是 cuda_host_alloc/cuda_host_free,注释有误 |
总体评价
整体实现思路清晰,算子设计与已有 XPU ops 风格一致(PD_BUILD_OP、自动扫描编译),mooncake 适配路径合理。建议补充 gpu_block_ids/cpu_block_ids 大小一致性校验以防止 UB,修正测试中的注释错误,并补全 ## Modifications 段落内容后即可合入。
| xpu_set_device(rank); // used for distributed launch | ||
| PD_CHECK(cache_xpu_tensors.size() > 0, "cache_xpu_tensors must not be empty"); | ||
|
|
||
| switch (cache_xpu_tensors[0].dtype()) { |
There was a problem hiding this comment.
🟡 建议 gpu_block_ids 与 cpu_block_ids 缺少大小一致性校验
内层循环(line 61)以 xpu_block_ids.size() 为上界迭代,并直接访问 cpu_block_ids[block_idx]。若调用方传入的两个列表长度不一致,将导致越界(C++ UB)。
建议在 SwapCacheLayout 入口处紧跟已有的 PD_CHECK 添加:
PD_CHECK(gpu_block_ids.size() == cpu_block_ids.size(),
"gpu_block_ids and cpu_block_ids must have the same size, got %zu vs %zu",
gpu_block_ids.size(), cpu_block_ids.size());
Motivation
添加swap_cache_layout op以支持mooncake,在_run_read_storage调用交换 xpu kv cache到cpu pinned memory。
mooncake原生不支持XPU,在CPU内存空间下连接mooncake后端。
Modifications
Usage or Command
Accuracy Tests
config


cache_manager.log
实例1写入mooncake

实例2接受mooncake

Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.